KV handoff with DMA slicing APIs to avoid KV input/output copies. by quic-akuruvil · Pull Request #1039 · quic/efficient-transformers

quic-akuruvil · 2026-06-04T17:14:44Z

Problem

If we don't use DMA slicing, in disaggregated serving, the QPC expect KV cache for all the batches as input, i.e if decode is BS=32 and lets say BS=4 got free, the QPC and LRT would expect KV caches again for all 32 batches. To fix this problem, DMA buffer slicing is introduced, where user can slice the DMA buffer into N Batches and write KV caches for each batch slot, by indexing the specific slot.

Idea

Disaggregated serving pipeline on QAIC with zero‑copy KV cache handoff.
Prefill to decode KV transfer happens through host (shared memory).
Shared memory is used so that there's no copy of KV cache when transferring from prefill to host.
Dump the kv cache from prefill devices to shared memory on host and then pass the pointer of shared memory to decode instance which loads up the kv cache directly from those host buffers.
This can be useful in the disaggregated setting for any large KV footprint. Since we are using DMA buffer slicing hence avoids taking KV as inputs between prefill decode sessions.

Optimization

Adds a new temporary QAICInferenceSession class (cloud_infer_kv_slice.py) that enables zero-copy KV-cache handoff between disaggregated prefill and decode sessions using shared DMA buffers and QAICRT API setDataWithSlices(). On the last prefill chunk, KV outputs are wired directly into the decode session's input slots via a sliced DMA descriptor — eliminating the Python/numpy copy at the prefill→decode boundary.

cluster_id="prefill" gives a pool of stages+1 slots for concurrent chunk pipelining; cluster_id="decode" gives a single fixed slot because decode is strictly sequential

Enables true prefill/decode overlap (exec-obj pool)

Existing method: uses a single QAICInferenceSession with one exec-obj. The CPU must call waitForCompletion() (blocking) before it can read KV outputs and set up the next call. Prefill and decode are strictly serialized.

KV slice method: uses separate cluster_id="prefill" and cluster_id="decode" sessions with exec-obj pools. setDataWithSlices is called before enqueue — the runtime knows where to write KV outputs before inference starts. This means:

A new prefill request can be enqueued on a free prefill exec-obj while decode is still running on its exec-obj
The prefill pool (stages+1 exec-objs) allows pipelined chunked prefill without stalling on waitForCompletion() between chunks

Sample Example Script

Also adds an end-to-end example (examples/disagg_serving/qwen3moe_disagg_mode_with_chunking_kvslice.py) demonstrating the full disaggregated serving flow for Qwen3-MoE with chunked prefill, PP (stages), TS, and DMA-sliced KV handoff.

Signed-off-by: Ann <quic_akuruvil@quicinc.com>

This PR adds MDP generation required for disaggregated serving for Prefill. Supports both Pipeline Prefill + Tensor Slicing and passing custom cores to the MDP generator. Also adds support for VLMs, compiler 'stages' option, and layerwise export. Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>

Signed-off-by: Ann <quic_akuruvil@quicinc.com>

quic-akuruvil requested review from anujgupt-github, quic-hemagnih, quic-rishinr and vbaddi June 4, 2026 17:16

quic-akuruvil assigned ochougul and quic-akuruvil and unassigned ochougul Jun 4, 2026

quic-akuruvil requested a review from ochougul June 4, 2026 17:20

quic-akuruvil and others added 4 commits June 9, 2026 15:05

Added inference serving with DMA slicing for KV handoff

13ed883

Signed-off-by: Ann <quic_akuruvil@quicinc.com>

Added 2L script temporarily for quick testing

3da9719

Signed-off-by: Ann <quic_akuruvil@quicinc.com>

Added sample script for VLM disaggregated with KV slice

1a4c5d5

Signed-off-by: Ann <quic_akuruvil@quicinc.com>

quic-akuruvil force-pushed the dma_slice branch from fe974d0 to 1a4c5d5 Compare June 10, 2026 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV handoff with DMA slicing APIs to avoid KV input/output copies.#1039

KV handoff with DMA slicing APIs to avoid KV input/output copies.#1039
quic-akuruvil wants to merge 4 commits into
quic:release/v1.22.0_tmpfrom
quic-akuruvil:dma_slice

quic-akuruvil commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

quic-akuruvil commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Idea

Optimization

Enables true prefill/decode overlap (exec-obj pool)

Sample Example Script

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

quic-akuruvil commented Jun 4, 2026 •

edited

Loading